We use the following packages in this Practical:
library(dplyr) # for data manipulation
library(magrittr) # for pipes
library(ggplot2) # for visualization
library(mice) # for the boys data
for-loopsfor-loop that loops over all numbers between 0 and 10, but only prints numbers below 5. for (i in 0:10) {
if (i < 5) {
print(i)
}
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
for (i in 0:10) {
if (i >= 3 & i <= 5) {
print(i)
}
}
## [1] 3
## [1] 4
## [1] 5
Or, even more efficiently:
for (i in 0:10) {
if (i %in% 3:5) {
print(i)
}
}
## [1] 3
## [1] 4
## [1] 5
byrow = TRUE to fill a matrix left-to-right instead of top-to-bottom.## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 2 3 4 5 6 7 8
## [2,] 2 4 6 8 10 12 14 16
## [3,] 3 6 9 12 15 18 21 24
## [4,] 4 8 12 16 20 24 28 32
## [5,] 5 10 15 20 25 30 35 40
# Create a matrix with 1 to 8.
mat <- matrix(1:8, ncol=8, nrow=5, byrow = TRUE)
# Loop over each row, and multiply it.
for (i in 1:5) {
mat[i, ] <- mat[i, ] * i
}
mat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 2 3 4 5 6 7 8
## [2,] 2 4 6 8 10 12 14 16
## [3,] 3 6 9 12 15 18 21 24
## [4,] 4 8 12 16 20 24 28 32
## [5,] 5 10 15 20 25 30 35 40
The anscombe data set is a wonderful data set from 1973 by Francis J. Anscombe aimed to demonstrate that pairs of variables can have the same statistical properties, while having completely differnt graphical representations. We will be using this data set more this week. If you’d like to know more about anscombe, you can simply call ?anscombe to enter the help.
You can directly call anscombe from your console because the datasets package is a base package in R. This means that it is always included and loaded when you start an R instance. In general, when you would like to access functions or data sets from packages that are not automatically loaded, we don’t have to explicitly load the package. We can also call package::thing-we-need to directly ‘grab’ the thing-we-need from the package namespace. For example,
test <- datasets::anscombe
identical(test, anscombe) #test if identical
## [1] TRUE
This is especially handy within functions, as we can call package::function-name to borrow functionality from installed packages, without loading the whole package. Calling only those functions that you need is more memory-efficient than loading it all. More memory efficient means faster computation.
summary) for each column of the anscombe dataset from the datasets package# Using i as an indicator for the current column.
for (i in 1:ncol(anscombe)) {
print(colnames(anscombe)[i])
print(summary(anscombe[, i]))
}
## [1] "x1"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## [1] "x2"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## [1] "x3"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## [1] "x4"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8 8 8 9 8 19
## [1] "y1"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.260 6.315 7.580 7.501 8.570 10.840
## [1] "y2"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.100 6.695 8.140 7.501 8.950 9.260
## [1] "y3"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.39 6.25 7.11 7.50 7.98 12.74
## [1] "y4"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.250 6.170 7.040 7.501 8.190 12.500
We could also loop over the variables directly. Although the code is then a bit more clear, this does imply that we can not access the names of the variables. The resulting output is therefore less informative.
for (i in anscombe) {
print(summary(i))
}
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8 8 8 9 8 19
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.260 6.315 7.580 7.501 8.570 10.840
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.100 6.695 8.140 7.501 8.950 9.260
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.39 6.25 7.11 7.50 7.98 12.74
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.250 6.170 7.040 7.501 8.190 12.500
apply()anscombe dataset using apply().apply(X = anscombe, MARGIN = 2, FUN = summary)
## x1 x2 x3 x4 y1 y2 y3 y4
## Min. 4.0 4.0 4.0 8 4.260000 3.100000 5.39 5.250000
## 1st Qu. 6.5 6.5 6.5 8 6.315000 6.695000 6.25 6.170000
## Median 9.0 9.0 9.0 8 7.580000 8.140000 7.11 7.040000
## Mean 9.0 9.0 9.0 9 7.500909 7.500909 7.50 7.500909
## 3rd Qu. 11.5 11.5 11.5 8 8.570000 8.950000 7.98 8.190000
## Max. 14.0 14.0 14.0 19 10.840000 9.260000 12.74 12.500000
Remember that in R, the first indicator in square brackets always indicates the row, and the second indicator always indicates the column, such that anscombe[2, 3] would give us the value for the intersection of the second row and the third column. The same rationale translates to the margins we would like apply() to iterate over. The argument MARGIN = 2 specifies the columns, while MARGIN = 1 would indicate that a function should be applied over the rows:
apply(anscombe, 1, summary)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## Min. 6.5800 5.7600 7.58000 7.11000 7.81000 7.0400 5.2500 3.10000
## 1st Qu. 7.8650 6.9050 7.92750 8.57750 8.24750 8.0750 6.0000 4.00000
## Median 8.5900 8.0000 10.74000 8.82500 8.86500 9.4000 6.0400 4.13000
## Mean 8.6525 7.4525 10.47125 8.56625 9.35875 10.4925 6.3375 7.03125
## 3rd Qu. 10.0000 8.0000 13.00000 9.00000 11.00000 14.0000 6.4075 7.16750
## Max. 10.0000 8.1400 13.00000 9.00000 11.00000 14.0000 8.0000 19.00000
## [,9] [,10] [,11]
## Min. 5.5600 4.82000 4.740
## 1st Qu. 8.1125 6.85500 5.000
## Median 9.9850 7.00000 5.340
## Mean 9.7100 6.92625 5.755
## 3rd Qu. 12.0000 7.42250 6.020
## Max. 12.0000 8.00000 8.000
We now see a returned matrix of 11 columns, that give us the summary() over the 11 rows in the anscombe data set.
dim(anscombe)
## [1] 11 8
anscombe dataset using sapply(). sapply(anscombe, summary)
## x1 x2 x3 x4 y1 y2 y3 y4
## Min. 4.0 4.0 4.0 8 4.260000 3.100000 5.39 5.250000
## 1st Qu. 6.5 6.5 6.5 8 6.315000 6.695000 6.25 6.170000
## Median 9.0 9.0 9.0 8 7.580000 8.140000 7.11 7.040000
## Mean 9.0 9.0 9.0 9 7.500909 7.500909 7.50 7.500909
## 3rd Qu. 11.5 11.5 11.5 8 8.570000 8.950000 7.98 8.190000
## Max. 14.0 14.0 14.0 19 10.840000 9.260000 12.74 12.500000
We can see that sapply() returns a matrix. We don’t have to specify any margins as the anscombe data set is of class data.frame:
class(anscombe)
## [1] "data.frame"
Objects of class data.frame can be addressed as a list, where the columns are the listed elements (see Lecture A). The function summary() will automatically be applied over the listed elements.
mean = 5 and sd = 1 - \(N(5, 1)\),rnorm(1000, 5) %>%
matrix(ncol = 2) %>%
plot()
anscombe data setanscombe %>%
cor()
## x1 x2 x3 x4 y1 y2
## x1 1.0000000 1.0000000 1.0000000 -0.5000000 0.8164205 0.8162365
## x2 1.0000000 1.0000000 1.0000000 -0.5000000 0.8164205 0.8162365
## x3 1.0000000 1.0000000 1.0000000 -0.5000000 0.8164205 0.8162365
## x4 -0.5000000 -0.5000000 -0.5000000 1.0000000 -0.5290927 -0.7184365
## y1 0.8164205 0.8164205 0.8164205 -0.5290927 1.0000000 0.7500054
## y2 0.8162365 0.8162365 0.8162365 -0.7184365 0.7500054 1.0000000
## y3 0.8162867 0.8162867 0.8162867 -0.3446610 0.4687167 0.5879193
## y4 -0.3140467 -0.3140467 -0.3140467 0.8165214 -0.4891162 -0.4780949
## y3 y4
## x1 0.8162867 -0.3140467
## x2 0.8162867 -0.3140467
## x3 0.8162867 -0.3140467
## x4 -0.3446610 0.8165214
## y1 0.4687167 -0.4891162
## y2 0.5879193 -0.4780949
## y3 1.0000000 -0.1554718
## y4 -0.1554718 1.0000000
x4, y4) on the anscombe data setUsing the standard %>% pipe:
anscombe %>%
subset(select = c(x4, y4)) %>%
cor()
## x4 y4
## x4 1.0000000 0.8165214
## y4 0.8165214 1.0000000
Alternatively, we can use the %$% pipe from package magrittr to make this process much more efficient.
anscombe %$%
cor(x4, y4)
## [1] 0.8165214
The boys dataset is part of package mice. It is a subset of 748 Dutch boystaken from the Fourth Dutch Growth Study. It’s columns record a variety of growth measures. Inspect the help for boys dataset and make yourself familiar with its contents.**
To learn more about the contents of the data, use one of the two following help commands:
help(boys)
?boys
boys data are sorted based on age. Verify this.To verify if the data are indeed sorted, we can run the following command to test the complement of that statement. Remember that we can always search the help for functions: e.g. we could have searched here for ?sort and we would quickly have ended up at function is.unsorted() as it tests whether an object is not sorted.
is.unsorted(boys$age)
## [1] FALSE
Or with a pipe:
boys %$%
is.unsorted(age)
## [1] FALSE
which returns FALSE, indicating that boys’ age is indeed sorted (we asked if it was unsorted!). To directly test if it is sorted, we could have used
!is.unsorted(boys$age)
## [1] TRUE
boys %$%
!is.unsorted(age)
## [1] TRUE
which tests if data data are not unsorted. In other words the values TRUE and FALSE under is.unsorted() turn into FALSE and TRUE under !is.unsorted(), respectively.
hgt and wgt in the boys data set from package mice.Because boys has missings values for almost all variables, we must first select wgt and hgt and then omit the rows that have missing values, before we can calculate the correlation. Using the standard %>% pipe, this would look like:
boys %>%
subset(select = c("wgt", "hgt")) %>%
cor(use = "pairwise.complete.obs")
## wgt hgt
## wgt 1.0000000 0.9428906
## hgt 0.9428906 1.0000000
which is equivalent to
boys %>%
subset(select = c("wgt", "hgt")) %>%
na.omit() %>%
cor()
## wgt hgt
## wgt 1.0000000 0.9428906
## hgt 0.9428906 1.0000000
Alternatively, we can use the %$% pipe:
boys %$%
cor(hgt, wgt, use = "pairwise.complete.obs")
## [1] 0.9428906
The %$% pipe exposes the listed dimensions of the boys dataset, such that we can refer to them directly.
boys data set, hgt is recorded in centimeters. Use a pipe to transform hgt in the boys dataset to height in meters and verify the transformationUsing the standard %>% and the %$% pipes:
boys %>%
transform(hgt = hgt / 100) %$%
mean(hgt, na.rm = TRUE)
## [1] 1.321518
Alternatively, we could also use the mutate() function.
boys %>%
mutate(hgt = hgt / 100) %$%
mean(hgt, na.rm = TRUE)
## [1] 1.321518
The mutate() function is more flexible than transform() as it allows for more complex data manipulation other than mere transformation.
hgt, wgt) two times: once for hgt in meters and once for hgt in centimeters. Make the points in the ‘centimeter’ plot red and in the ‘meter’ plot blue. This is best done with the %T>% pipe:
boys %>%
subset(select = c(hgt, wgt)) %T>%
plot(col = "red", main = "Height in centimeters") %>%
transform(hgt = hgt / 100) %>%
plot(col = "blue", main = "Height in meters")
The %T>% pipe is very useful, because it creates a literal T junction in the pipe. It is perhaps most informative to graphically represent the above pipe as follows:
boys %>%
subset(select = c(hgt, wgt)) %T>%
plot(col = "red", main = "Height in centimeters") %>%
transform(hgt = hgt / 100) %>%
plot(col = "blue", main = "Height in meters")
We can see that there is indeed a literal T-junction. Naturally, we can expand this process with more %T>% pipes. However, once a pipe gets too long or too complicated, it is perhaps more useful to cut the piped problem into smaller, manageble pieces.
plot() is the core plotting function in R. Find out more about plot(): Try both the help in the help-pane and ?plot in the console. Look at the examples by running example(plot).The help tells you all about a functions arguments (the input you can specify), as well as the element the function returns to the Global Environment. There are strict rules for publishing packages in R. For your packages to appear on the Comprehensive R Archive Network (CRAN), a rigorous series of checks have to be passed. As a result, all user-level components (functions, datasets, elements) that are published, have an acompanying documentation that elaborates how the function should be used, what can be expected, or what type of information a data set contains. Help files often contain example code that can be run to demonstrate the workings.
?plot
example(plot)
##
## plot> require(stats) # for lowess, rpois, rnorm
##
## plot> plot(cars)
##
## plot> lines(lowess(cars))
##
## plot> plot(sin, -pi, 2*pi) # see ?plot.function
##
## plot> ## Discrete Distribution Plot:
## plot> plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
## plot+ main = "rpois(100, lambda = 5)")
##
## plot> ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:
## plot> plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")
##
## plot> points(x, cex = .5, col = "dark red")
There are many more functions that can plot specific types of plots. For example, function hist() plots histograms, but falls back on the basic plot() function. Packages lattice and ggplot2 are excellent packages to use for complex plots. Pretty much any type of plot can be made in R. A good reference for packages lattice that provides all R-code can be found at http://lmdvr.r-forge.r-project.org/figures/figures.html. Alternatively, all ggplot2 documentation can be found at http://docs.ggplot2.org/current/
age and bmi in the mice::boys data setWith the standard plotting device in R:
mice::boys %$% plot(bmi ~ age)
or, with ggplot2:
p <- ggplot(mice::boys, aes(age, bmi))
p + geom_point()
## Warning: Removed 21 rows containing missing values (geom_point).
Package ggplot2 offers far greater flexibility in data visualization than the standard plotting devices in R. However, it has its own language, which allows you to easily expand graphs with additional commands. To make these expansions or layers clearly visible, it is advisable to use the plotting language conventions. For example,
mice::boys %>%
ggplot(aes(age, bmi)) +
geom_point()
would yield the same plot as
ggplot(mice::boys, aes(age, bmi)) + geom_point()
but the latter style may be less informative, especially if more customization takes place and if you share your code with others.
bmi < 18.5 use color = "light blue"bmi > 18.5 & bmi < 25 use color = "light green"bmi > 25 & bmi < 30 use color = "orange"bmi > 30 use color = "red"Hint: it may help to expand the data set with a new variable.
It may be easier to create a new variable that creates the specified categories. We can use the cut() function to do this quickly
boys2 <-
boys %>%
mutate(class = cut(bmi, c(0, 18.5, 25, 30, Inf),
labels = c("underweight",
"healthy",
"overweight",
"obese")))
by specifying the boundaries of the intervals. In this case we obtain 4 intervals: 0-18.5, 18.5-25, 25-30 and 30-Inf. We used the %>% pipe to work with bmi directly. Alternatively, we could have done this without a pipe:
boys3 <- boys
boys3$class <- cut(boys$bmi, c(0, 18.5, 25, 30, Inf),
labels = c("underweight",
"healthy",
"overweight",
"obese"))
to obtain the same result.
With the standard plotting device in R we can now specify:
plot(bmi ~ age, subset = class == "underweight", col = "light blue", data = boys2,
ylim = c(10, 35), xlim = c(0, 25))
points(bmi ~ age, subset = class == "healthy", col = "light green", data = boys2)
points(bmi ~ age, subset = class == "overweight", col = "orange", data = boys2)
points(bmi ~ age, subset = class == "obese", col = "red", data = boys2)
and with ggplot2 we can call
boys2 %>%
ggplot() +
geom_point(aes(age, bmi, col = class))
## Warning: Removed 21 rows containing missing values (geom_point).
Although the different classifications have different colours, the colours are not conform the specifications of this exercise. We can manually override this:
boys2 %>%
ggplot() +
geom_point(aes(age, bmi, col = class)) +
scale_color_manual(values = c("light blue", "light green", "orange", "red"))
## Warning: Removed 21 rows containing missing values (geom_point).
Because there are missing values, ggplot2 displays a warning message. If we would like to not consider the missing values when plotting, we can simply exclude the NAs by using a filter():
boys2 %>%
filter(!is.na(class)) %>%
ggplot() +
geom_point(aes(age, bmi, col = class)) +
scale_color_manual(values = c("light blue", "light green", "orange", "red"))
Specifying a filter on the feature class is sufficient: age has no missings and the missings in class directly correspond to missing values on bmi. Filtering on bmi would therefore yield an identical plot.
age in the boys data setWith the standard plotting device in R:
boys %$%
hist(age, breaks = 50)
The breaks = 50 overrides the default breaks between the bars. By default the plot would be
boys %$%
hist(age)
Using a pipe is a nice approach for this plot because it inherits the names of the objects we aim to plot. Without the pipe we might need to adjust the main title for the histogram:
hist(boys$age, breaks = 50)
With ggplot2:
boys %>%
ggplot() +
geom_histogram(aes(age), binwidth = .4)
Please note that the plots from geom_histogram() and hist use different calculations for the bars (bins) and hence may look slightly different.
reg in the boys data setWith a standard plotting device in R:
boys %$%
table(reg) %>%
barplot()
With ggplot2:
boys %>%
ggplot() +
geom_bar(aes(reg))
Note that geom_bar by default plots the NA’s, while barplot() omits the NA’s without warning. If we would not like to plot the NAs, then a simple filter() (see exercise 2) on the boys data is efficient.
hgt with different boxes for reg in the boys data setWith a standard plotting device in R:
boys %$%
boxplot(hgt ~ reg)
With ggplot2:
boys %>%
ggplot(aes(reg, hgt)) +
geom_boxplot()
## Warning: Removed 20 rows containing non-finite values (stat_boxplot).
age with different curves for boys from the city and boys from rural areas (!city).With a standard plotting device in R:
d1 <- boys %>%
subset(reg == "city") %$%
density(age)
d2 <- boys %>%
subset(reg != "city") %$%
density(age)
plot(d1, col = "red", ylim = c(0, .08))
lines(d2, col = "blue")
The above plot can also be generated without pipes, but results in an ugly main title. You may edit the title via the main argument in the plot() function.
plot(density(boys$age[!is.na(boys$reg) & boys$reg == "city"]),
col = "red",
ylim = c(0, .08))
lines(density(boys$age[!is.na(boys$reg) & boys$reg != "city"]),
col = "blue")
With ggplot2 everything looks much nicer:
boys %>%
mutate(area = ifelse(reg == "city", "city", "rural")) %>%
filter(!is.na(area)) %>%
ggplot(aes(age, fill = area)) +
geom_density(alpha = .3) # some transparency
hgt in the boys data set, that displays for every age year that year’s mean height in deviations from the overall average hgtLet’s not make things too complicated and just focus on ggplot2:
boys %>%
mutate(Hgt = hgt - mean(hgt, na.rm = TRUE),
Age = cut(age, 0:22, labels = 0:21)) %>%
aggregate(Hgt ~ Age, data = ., mean) %>% #specify data = . to allow formula
mutate(Diff = cut(Hgt, c(-Inf, 0, Inf),
labels = c("Below Average", "Above Average"))) %>%
ggplot(aes(x = Age, y = Hgt, fill = Diff)) +
geom_bar(stat = "identity") +
coord_flip()
We can clearly see that the average height in the group is reached just before age 7. The aggregate() function is used to return the mean() of deviation Hgt for every group in Age.
For example, if we would like the mean height hgt for every region reg in the boys data, we could call:
boys %>%
aggregate(hgt ~ reg, data = ., FUN = mean)
## reg hgt
## 1 north 151.6316
## 2 east 133.9648
## 3 west 130.2783
## 4 south 128.0022
## 5 city 125.8577
We have to specify data = . in order to allow for the formula-style call to aggregate() - where the method is of class formula. However, the data set boys is parsed down the pipe as an object of class data.frame. The default evaluation of argument would therefore be
aggregate(x, by, FUN)
and in the pipe this is by default evaluated as
aggregate(., hgt ~ reg, mean)
where . is the object parsed down the pipe. This . is automatically evaluated as the first argument, unless otherwise specified by the user. In most cases this works because the data is usually the first argument that is evaluated in a function. However, the result is not a valid call to aggregate() because the object we’re parsing down the pipe has class data.frame. aggregate() would therefor try to run aggregate.data.frame(), but our code dictates an evaluation of the formula call. By assigning the . to data = ., we specifically call for a formula evaluation of aggregate(). This solves the mismatch and forces aggregate() to conform to class formula:
aggregate(formula, data, FUN)
which is evaluated as
aggregate(hgt ~ reg, data = ., mean)
in the pipe. Problem solved!
The specifics about calling functions and evaluating their arguments can always be found in the help. Try ?aggregate to see all forms this function’s call may take.
End of Practical
ggplot2 reference pagemagrittrR for Data Science - Chapter 18 on pipes